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ABSTRACT 

CTCF is a highly conserved transcriptional regulator 
protein that performs diverse functions such as 
regulating gene expression and organizing the 3D 
structure of the genome. Here, we describe recent 
updates to a database of CTCF-binding sites, 
CTCFBSDB (http://insulatordb.uthsc.edu/), which 
now contains almost 15 million CTCF-binding 
sequences in 10 species. Since the original publica- 
tion of the database, studies of the 3D structure of 
the genome, such as those provided by Hi-C experi- 
ments, have suggested that CTCF plays an import- 
ant role in mediating intra- and inter-chromosomal 
interactions. To reflect this important progress, we 
have integrated CTCF-binding sites with genomic 
topological domains defined using Hi-C data. 
Additionally, the updated database includes new 
features enabled by new CTCF-binding site data, 
including binding site occupancy and the ability to 
visualize overlapping CTCF-binding sites deter- 
mined in separate experiments. 

INTRODUCTION 

The CCCTC-binding factor, CTCF, is a ubiquitously 
expressed transcriptional regulator protein that is highly 
conserved from fly to man (1,2). It was first identified as a 
transcriptional repressor of the MYC oncogene (3,4) and, 
subsequently, has been shown to be involved in an extra- 
ordinarily diverse set of regulatory functions including 
transcriptional activation, imprinting, X-chromosome 
activation and acting as an enhancer-blocking and/or 
barrier insulator-binding protein (2). A few years ago, 
several groups attempted to better characterize CTCF 
function by identifying human and mouse CTCF- 
binding sites genome wide using both experimental and 
computational methods (5-8). These studies focused on 
CTCF's role as an insulator-binding protein, finding that 



CTCF-binding sites were detected between active and 
silent chromatin domains (7) and that the expression of 
neighboring genes separated by predicted CTCF-binding 
sites is less correlated than random sets of neighboring 
genes (6). Additionally, these datasets of CTCF-binding 
sites were used to establish that, while many functional 
CTCF-binding sites do not match a consensus motif (9), 
there is a CTCF-binding site motif that is highly conserved 
in vertebrates (5). Initial consensus CTCF-binding site 
motifs were then used to computationally predict CTCF- 
binding sites (5,6). Within this context, we introduced the 
first public database of CTCF-binding sites, CTCFBSDB, 
in 2007 (10). The initial version of CTCFBSDB contained 
34420 experimental and 18 905 predicted CTCF-binding 
sequences and integrated these sites with functional anno- 
tations and gene expression profiles to examine how the 
binding sites may provide insulator function. 

Since the introduction of CTCFBSDB, there have been 
many significant developments in understanding the role 
of CTCF. To a large extent, these developments have 
focused on how CTCF functions as the 'master weaver' 
of the genome by establishing the long-range intra- and 
inter-chromosomal contacts between chromatin fibers that 
organize the genome in three dimensions (2,9). In addition 
to CTCF being responsible for long-range interactions at 
specific loci such as (3-globin, H19 ICR and MHC-II (2), 
CTCF-binding sites have been connected to several key 
observations from Hi-C experiments that provide 
genome-wide 3D maps of chromatin interactions (11,12). 
Specifically, CTCF-binding sites were found to be signifi- 
cantly overrepresented both on Hi-C fragments that had a 
large number of long-range interactions (13) and at the 
boundaries of the topological domains that spatially 
organize the genome (12). In parallel with this changing 
understanding of the importance of CTCF, there has been 
remarkable growth in the number of experimentally 
identified CTCF-binding sites. These new binding sites 
have been used to investigate the mechanism through 
which CTCF binds to DNA sequences, resulting in the 
identification of multi-part sequence motifs that bind to 
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CTCF (14-16) and the suggestion that the degree of 
occupancy at a binding site may be related to the 
binding type and function at the site (17). In this article, 
we discuss improvements to the CTCFBSDB that reflect 
this recent progress in understanding the function of 
CTCF. 



NEW FEATURES 

In addition to the significant expansion in the number of 
binding sequences available in the database which will be 
discussed in the next section, we have modified the pres- 
entation of binding sites in the CTCFBSDB (Figure 1) to 
include several new features: 

(i) Inclusion of genomic topological domains defined 
using Hi-C data: the boundaries of these domains 
are enriched for CTCF-binding sites (12). We 
calculated the distance from each CTCF-binding 
sequence to the nearest domain boundary to help 
identify binding sites that may function to 



organize these domains. We also allow users to 
browse the topological domains to identify CTCF- 
binding sites at the boundaries of specific domains. 

(ii) Identification of CTCF-binding sequences that over- 
lap a given CTCF-binding sequence: the database 
now contains CTCF-binding sites identified in 
many tissues and cell types in mice and humans, 
making it possible to investigate if CTCF binding 
is specific to a particular cell type or conserved and, 
potentially, help limit the location of a binding site 
to a more narrow range. 

(iii) Inclusion of occupancy data: we display the occu- 
pancy of the CTCF-binding site, when available. 
CTCF-binding site occupancy has been used to in- 
vestigate both the potential for buffering of poly- 
morphisms within binding sites (18) and how the 
CTCF-binding motif changes depending on the 
occupancy (17). 

(iv) Classification of motif match type: recent analysis of 
the conservation of CTCF-binding sites across ver- 
tebrates has found that CTCF binding at many sites 
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Figure 1. Screenshot of an example webpage for a CTCF-binding sequence (ENCODE_OC_hgl8_MCF-7_744758) in CTCFBSDB 2.0. The 
database provides a description of the binding site, where the binding sequence is located within topological domains, and a Genome Browser 
viewer showing the genomic context of the binding site. Users also have the option to display the expression of genes flanking the binding site and 
CTCF-binding sequences that overlap the sequence. This CTCF-binding sequence, which was identified in MCF-7 cells, overlaps binding sequences 
that were identified in four other cell types. 
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can be understood in terms of a two-part motif in 
which each part interacts with distinct CTCF zinc 
fingers (16). We classify the CTCF-binding 
sequences based on how they match these motifs, 
allowing users to investigate the types of inter- 
actions that take place in the binding event, 
(v) Integration with Genome Browser: the sequence 
context of each binding site in CTCFBSDB, 
including polymorphisms and DNA-methylation 
sites within the binding sequences, can be visualized 
in a Genome Browser viewer (19). Overlapping 
CTCF-binding sequences and topological domains 
are also displayed, facilitating the use of these new 
features. 



DATABASE CONTENT 

Sources of CTCF-binding sites 

We expanded the CTCFBSDB using data from a 
variety of sources containing CTCF-binding sites 
determined using genome-wide experimental methods, 
and CTCFBSDB now contains 14 735 367 experimentally 
determined CTCF-binding sequences, including 
13 760 124 human-binding sequences and 821 858 mouse- 
binding sequences. For human and mouse, the database 
contains CTCF-binding sequences identified in many cell 
types and experiments. Therefore, these sequences may 
include binding sites that have been repeatedly found in 
different cell types and experiments. We grouped 
overlapping binding sequences into CTCF-binding 
sequence clusters and identified 433 747 and 149 141 
clusters in human (hgl9) and mouse (mm9), respectively. 

The sources of binding sites collected in the database 
include six published articles that utilized ChlP-Seq 
(16,20-24) and two articles using new ChlP-exo (15) and 
ChlA-PET (25) methods that have provided tens of thou- 
sands of CTCF-binding sequences in each of seven species 
(human, macaque, mouse, rat, dog, opossum and 
chicken). Additionally, we collected 145 human and 18 
mouse CTCF-binding site datasets identified by the 
ENCODE project (26,27) that were publicly available as 
of 30 June 2012. Each CTCF-binding sequence in the 
database is identified by a prefix containing information 
about the data source appended to a number, creating a 
unique identifier for each binding sequence. For binding 
site datasets from ENCODE, the cell type and experimen- 
tal treatment, if specified, were added to the end of the 
identifier prefix. A table containing a complete listing of 
the sources of the data in CTCFBSDB and the binding 
sequence identifier prefixes is provided on the database 
website 'Help' page. 

CTCF-binding sites at topological domains boundaries 

As technological advancements have enabled the study of 
how the genome is packaged into the nuclei of eukaryotes, 
they have consistently confirmed that there are strong 
links between the spatial organization of the genome 
and biological function (2,9). One of the most significant 



new experimental techniques that investigate the 3D struc- 
ture of the genome is Hi-C, which was first applied to 
create a genome-wide map of chromatin interactions in 
a human lymphoblastoid cell line (11), and more 
recently, has been used to study mouse and human 
embryonic stem (ES) cells, mouse cortex and human 
IMR90 fibroblasts (12). A primary result of this later 
study is that the genome is organized into megabased- 
sized topological domains that occur throughout the 
genome and are conserved across different cell types and 
between mouse and human. Local chromatin interactions 
within a topological domain are common, while inter- 
actions between different domains or with boundary 
regions that separate domains are comparatively rare. 
While only 15% of CTCF-binding sites were located 
within boundary regions, there was a significant enrich- 
ment of CTCF-binding sites at domain boundaries (12), 
adding to the evidence that CTCF plays an important role 
in higher order genome organization. 

To integrate CTCF-binding sites with topological 
domains, we downloaded 7947 human (hgl8) and 8937 
mouse (mm9) topological domains from the project web- 
site (http://chromosome.sdsc.edu/mouse/hi-c/download. 
html) of recent Hi-C experiments. The topological 
domains included in CTCFBSDB were determined for a 
bin size of 40 kb combined across multiple replicates of 
Hi-C interactions determined using the Hindlll or Ncoll 
restriction enzyme for human and mouse ES cells, mouse 
cortex and human IMR90 fibroblasts. We then 
determined if each CTCF-binding sequence in the hgl8 
or mm9 genomes was located within a topological 
domain or within a boundary region between domains 
and calculated the distance, in bp, between the edges of 
the binding sequence and the topological region. 

Binding site motif classification 

Due to the diversity in CTCF function, it has been sug- 
gested that different functions may be conferred by differ- 
ent CTCF-DNA-binding modes, potentially involving 
different combinations of interactions with the 11 zinc 
fingers that compose CTCF's DNA-binding domain 
(1,2). Using genome-wide CTCF-binding sites determined 
in six mammalian species, Schmidt et al. (16) recently 
investigated this possibility by examining the binding site 
sequences and, agreeing with previous observations 
(14,15), delineated a multi-part CTCF-binding motif. 
They observed that, for the majority of CTCF-DNA- 
binding events, the N-terminal zinc fingers interact with 
a 14-bp long Ml motif. Additionally, in a subset of 
binding events, the C-terminal fingers interact with a 
shorter M2 interaction, creating a 34-bp-long M1+M2 
motif. In the most common arrangement of sites contain- 
ing M1+M2 motifs, the half-site distance between Ml and 
M2 was 21 or 22 bp. In order to classify the CTCF- 
binding sites based on the type of binding event, each 
binding sequence was scanned for matches to the Ml 
and M2 CTCF-binding motifs described by Schmidt 
et al. (16) and provided at http://www.ebi.ac.uk/ 
~schwalie/CTCFCell2012/ using the nmscan module of 
NestedMica (28) with a cutoff of —15. They were then 
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classified as None (no Ml motif matches), Ml 
(the sequence matches the Ml motif), M1M2 (the 
sequence contains matches to the Ml and M2 motifs 
separated by a half-site distance of 12^42 bp) and 
M1M2_21_22 (the sequence contains matches to the Ml 
and M2 motifs that were separated by a half-site distance 
of 21 or 22 bp). Additionally, we included the position 
weight matrices of the Ml and M2 motifs in the 
CTCFBSDB Prediction Tool, which has been described 
previously (10), allowing users to scan query sequences 
for CTCF-binding site motifs. 

Flanking gene expression 

To investigate the potential for CTCF-binding sites to 
function as insulators, CTCFBSDB includes a comparison 
of the expression of the genes flanking each 



CTCF-binding site (Figure 2). In the original version of 
the database, this comparison was a heatmap image 
comparing microarray-based gene expression profiles 
from 61 mouse and 79 human tissues (29). We have main- 
tained these microarray gene expression heatmaps in the 
updated version of the database, but present an additional 
figure containing RNA-Seq gene expression profiles 
determined in 10 human tissues (30). As the RNA-Seq 
data contain only 10 tissues, we display a column chart 
comparing the number of normalized Reads Per Kilobase 
of exon per Million mapped reads for the flanking genes of 
each CTCF-binding site. In CTCFBSDB 2.0, the 
microarray expression profiles are rendered using the 
BioHeatmap Javascript library (http://code.google.com/ 
p/systemsbiology-visualizations/), whereas the RNA-Seq 
column charts use Google Visualization APIs (https:// 
developers.google.com/chart/). 
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Figure 2. Gene expression profiles for genes flanking a CTCF-binding site (ENCODE_OC_hgl8_MCF-7_744758). CTCFBSDB provides images 
comparing expression profiles identified using both RNA-Seq (top) and microarrays (bottom) for genes flanking the CTCF-binding site. 
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Table 1. Description of fields used to annotated CTCF-binding sites 
Field name Description 
ID 

Species and build 
Location 
ENCODE 
Source 

Cell and experiment type 
Occupancy 
Occupancy% 
M1M2 Class 
ENCODE Peak location 



Additional CTCF-binding site annotation 

In addition to identifying the topological domain location 
and binding motif type, each CTCF-binding sequence in 
the database is annotated with descriptions of the binding 
site and the experiment in which the site was identified 
(Table 1). Of particular interest among these annotations 
are two fields that show the occupancy of the binding site 
that was determined in the experiment. The first of these, 
'Occupancy', provides a numeric value (i.e. read count for 
ChlP-Seq experiments or signal strength for ENCODE 
data) indicating the extent to which the binding site was 
occupied in the experiment, if available. As the values in 
the 'Occupancy' field had different scales for different 
experiments, we calculated the percentile of the occupancy 
value for each binding site within its dataset to allow for 
comparisons across experiments in the 'Occupancy%' 
field. A value of 99 in this field indicates that the 
binding site was in the top 1% of high occupancies 
within the dataset. 

DATABASE USE AND ACCESS 

Users can access CTCFDB through a variety of browse 
and search options. The contents of the database can be 
browsed through three tables containing experimentally 
identified CTCF-binding sequences, topological domains 
and computationally predicted CTCF-binding sites, 
respectively. The browseable experimental binding 
sequence table contains the unique CTCFBSDB identifier, 
which links to a page containing the full database record 
of the binding sequence, and a brief description of the 
binding site. This table can be filtered by species, cell 
type and chromosome, allowing users to quickly view 
relevant binding sites. The topological domain table can 
be filtered by species, cell type and chromosome and 
displays a unique database identifier for the topological 
domain or boundary and location of the domain. Clicking 
on the domain identifier presents a list of all CTCF- 
binding sequences located within the domain sorted by 
chromosome location. The predicted CTCF-binding site 
table remains unchanged from the first release of the 
database. 

CTCFBSDB contains two options for searching the 
database. First, users can search for all binding sites in a 
species within a genomic range. Optionally, the search 
results can be filtered to present only binding sequences 
from a single data source or, due to the large percentage of 



database records that were collected from ENCODE 
project data causing ENCODE-binding sites to sometimes 
overwhelm search results, the search can be filtered to 
include all binding sequences, include only those binding 
sequences identified in the ENCODE project, or exclude 
ENCODE-binding sequences. Second, for quick access to 
previously investigated binding sequences, a keyword 
search can be used to search for a particular 
CTCFBSDB identifier. 

Each experimental binding site in CTCFBSDB is pre- 
sented on a webpage (Figure 1) that contains the following 
five sections: (i) Description: a table presenting a descrip- 
tion of binding site, including the annotation information 
presented in Table 1; (ii) Topological Domains: a table 
presenting the domain identifier, type, location and 
distance from the binding sequence to the nearest edge 
of the domain boundary for the topological domains in 
which the binding site is located; (hi) Flanking Gene 
Expression: figures (Figure 2) comparing RNA-Seq and 
microarray expression profiles of the genes flanking the 
binding site; (iv) Overlapping CTCF-Binding Sites: a 
table containing CTCF-binding site sequences that 
overlap this sequence and (v) Genome Browser: a 
Genome Browser viewer (19) that displays the genomic 
context of the binding site, including UCSC genes, SNPs 
and custom tracks for the binding site and topological 
domains. Additionally, as methylation at CTCF-binding 
sites has been shown to impact CTCF binding (31,32), we 
display methylation tracks provided by the ENCODE 
project for human genome (the ENC DNA Methyl track 
for hgl9 and the HAIB Methyl-seq and HAIB Methyl27 
tracks for hgl8) in the Genome Browser viewer, allowing 
users to quickly identify methylation sites within CTCF- 
binding sequences. By default, the flanking gene expres- 
sion figure and overlapping binding site table are hidden, 
but can quickly be displayed by selecting a clearly labeled 
box. Displaying the overlapping CTCF-binding site table 
automatically adds a custom track to the Genome 
Browser containing these sites, allowing for visualization 
of the extent to which the binding site sequence overlaps 
other CTCF-binding sequences identified in other cell 
types or experiments. 

DISCUSSION AND FUTURE DIRECTIONS 

Updates made in version 2.0 of the CTCFBSBD reflect 
significant advances in both the number of known 
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CTCF-binding sites and the function of CTCF. In 
addition to a 250-fold increase in the binding site 
sequences included in CTCFBSDB, the database now in- 
tegrates new data describing the details of the binding site 
(i.e. binding site occupancy, motif match type and location 
within topological domain), which can potentially be used 
to investigate the function of specific binding sites. With 
the large number of experiments that have determined 
CTCF-binding sites, it is likely that the majority of 
binding sites in the mouse and human genomes have 
already been identified. A next step in understanding the 
function of CTCF is determining if and how specific 
features of these binding sites allow CTCF to perform 
its diverse functions. The CTCFBSDB has the potential 
to be particularly useful to this effort, as it may not require 
the identification of new binding sites, but, instead, can be 
based on analysis of known binding sites. For example, 
data contained in the database can be used to compare 
binding sites located at the boundaries of topological 
domains with those in the domain centers to determine 
the characteristics that distinguish these types of binding 
sites. 

In the future, the utility of the CTCFBSDB can be 
improved in several ways. The results of Hi-C and 
similar experiments will continue to increase understand- 
ing of the 3D structure and the role that CTCF plays in 
organizing this structure. It is likely that some data 
generated by these studies can be integrated with the 
CTCFBSDB, similar to how we have included the loca- 
tions of topological domains. Additionally, while CTCF 
has been shown to interact with several other proteins (9), 
such as cohesin (33-35), the interactions between CTCF 
and these other proteins are not completely understood. 
As more binding sites of cohesin or other proteins that 
interact with CTCF are identified, these binding sites 
can be integrated into the CTCFBSDB, adding new data 
that can be used to determine the function of a binding 
site. 
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